Perhaps the most famous example of a recommender engine in the Data Science world was the Netflix competition started in 2006, in which teams from all around the world competed to improve on Netflix's reccomendation algorithm. The final prize of $1,000,000 was awarded to a team which developed a solution which had about a 10% increase in accuracy over Netflix's. In fact, this competition resulted in the development of some new techniques which are still in use. For more reading on this topic, see Simon Funk's blog post
In this exercise, you will build a collaborative-filter recommendatin engine using both a cosine similarity approach and SVD (singular value decomposition). Before proceding download the MovieLens dataset.
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.read_csv('./data/ml-100k/u.data', sep='\t', header=None)
df.columns = ['userid', 'itemid', 'rating', 'timestamp']
df['timestamp'] = pd.to_datetime(df['timestamp'],unit='s')
df.head()
Out[2]:
Before building any recommendation engines, we'll have to get the data into a useful form. Do this by first splitting the data into testing and training sets, and then by constructing two new dataframes whose columns are each unique movie and rows are each unique user, filling in 0 for missing values.
In [3]:
len(df['userid'].unique()), len(df['itemid'].unique())
Out[3]:
In [4]:
from sklearn.model_selection import train_test_split
dftrain, dftest = train_test_split(df, test_size=0.2)
In [5]:
len(dftrain['userid'].unique()), len(dftrain['itemid'].unique())
Out[5]:
In [6]:
missing_items = set(df['itemid'].unique()) - set(dftrain['itemid'].unique())
items = []
for item in missing_items:
items.append([1, item, 0, np.nan])
dftrain_filled = dftrain.append(pd.DataFrame(items, columns=dftrain.columns)).reset_index()
In [7]:
len(dftrain_filled['userid'].unique()), len(dftrain_filled['itemid'].unique())
Out[7]:
In [8]:
len(dftest['userid'].unique()), len(dftest['itemid'].unique())
Out[8]:
In [9]:
missing_items = set(df['itemid'].unique()) - set(dftest['itemid'].unique())
items = []
for item in missing_items:
items.append([1, item, 0, np.nan])
dftest_filled = dftest.append(pd.DataFrame(items, columns=dftest.columns)).reset_index()
In [10]:
len(dftest_filled['userid'].unique()), len(dftest_filled['itemid'].unique())
Out[10]:
In [11]:
# X = pd.pivot_table(df, values='rating', index='userid', columns='itemid').fillna(0)
# X.head()
Xtrain = pd.pivot_table(dftrain_filled, values='rating', index='userid', columns='itemid').fillna(0)
Xtest = pd.pivot_table(dftest_filled, values='rating', index='userid', columns='itemid').fillna(0)
Xtrain.head()
Out[11]:
In [12]:
print(Xtrain.shape)
print(Xtest.shape)
Now split the data into a training and test set, using a ratio 80/20 for train/test.
In [13]:
# from sklearn.model_selection import train_test_split
# Xtrain, Xtest = train_test_split(X, test_size=0.2)
Building a recommendation engine can be thought of as "filling in the holes" in the sparse matrix you made above. For example, take a look at the MovieLense data. You'll see that that matrix is mostly zeros. Our task here is to predict what a given user will rate a given movie depending on the users tastes. To determine a users taste, we can use cosine similarity which is given by $$s_u^{cos}(u_k,u_a) = \frac{ u_k \cdot u_a }{ \left \| u_k \right \| \left \| u_a \right \| } = \frac{ \sum x_{k,m}x_{a,m} }{ \sqrt{\sum x_{k,m}^2\sum x_{a,m}^2} }$$ for users $u_a$ and $u_k$ on ratings given by $x_{a,m}$ and $x_{b,m}$. This is just the cosine of the angle between the two vectors. Likewise, this can also be calculated for the similarity between two items, $i_a$ and $i_b$, given by $$s_u^{cos}(i_m,i_b) = \frac{ i_m \cdot i_b }{ \left \| i_m \right \| \left \| i_b \right \| } = \frac{ \sum x_{a,m} x_{a,b} }{ \sqrt{ \sum x_{a,m}^2 \sum x_{a,b}^2 } }$$
Then, the similarity between two users is given by $$\hat{x}_{k,m} = \bar{x}_{k} + \frac{\sum\limits_{u_a} s_u^{cos}(u_k,u_a) (x_{a,m})}{\sum\limits_{u_a}|s_u^{cos}(u_k,u_a)|}$$ and for items given by $$\hat{x}_{k,m} = \frac{\sum\limits_{i_b} s_u^{cos}(i_m,i_b) (x_{k,b}) }{\sum\limits_{i_b}|s_u^{cos}(i_m,i_b)|}$$
Use these ideas to construct a class cos_engine which can be used to recommend movies for a given user. Be sure to also test your algorithm, reporting its accuracy. Bonus: Use adjusted cosine similiarity.
In [14]:
class cos_engine():
def __init__(self, k, how='user'):
self.k = k
self.how = how
def _cos_sim(self, x, y):
num = np.sum(x * y)
den = np.sum(x**2) * np.sum(y**2)
return num / np.sqrt(den)
def _user_based_filtering(self, user, X):
sim = X.apply(lambda row: self._cos_sim(X.loc[user, :].values, row.values), axis=1)
sim = sim.drop(user, axis=0)
rec = sim.sort_values(ascending=False).head(self.k)
return rec
def _item_based_filtering(self, item, X):
Xt = X.T
sim = Xt.apply(lambda row: self._cos_sim(Xt.loc[item, :].values, row.values), axis=1)
sim = sim.drop(item, axis=0)
rec = sim.sort_values(ascending=False).head(self.k)
return rec
def fit(self, X):
if self.how == 'user':
Xfilled = []
for user in X.index.values:
ubf = self._user_based_filtering(user, X)
weighted_sum = np.sum(X.loc[ubf.index, :].values * ubf.values.reshape(-1, 1), axis=0)
Xfilled.append(weighted_sum / np.sum(np.abs(ubf.values)))
self.Xfilled = pd.DataFrame(np.array(Xfilled), index=X.index, columns=X.columns)
elif self.how == 'item':
Xtfilled = []
for item in X.columns.values:
ibf = self._item_based_filtering(item, X)
Xt = X.T
weighted_sum = np.sum(Xt.loc[ibf.index, :].values * ibf.values.reshape(-1, 1), axis=0)
Xtfilled.append(weighted_sum / np.sum(np.abs(ibf.values)))
self.Xfilled = pd.DataFrame(np.array(Xtfilled).T, index=X.index, columns=X.columns)
else:
print('Use either "user" or "item" for the parameter "how".')
def predict(self, user, X):
try:
ratings = self.Xfilled.loc[user, :]
rated_items = X.loc[user, :] > 0
ratings = ratings[~rated_items]
return ratings.sort_values(ascending=False)
except AttributeError:
print('You have to fit the recommender first!')
def score(self, X):
try:
return np.sum(np.sum(np.abs(self.Xfilled - X.where(X > 0)))) / X.where(X > 0).count().sum()
except AttributeError:
print('You have to fit the recommender first!')
In [15]:
rec_ub = cos_engine(k=7)
rec_ub.fit(Xtrain)
In [16]:
rec_ub.Xfilled.loc[100, 258]
Out[16]:
In [19]:
rec_ub.predict(100, Xtrain + Xtest).head(10)
Out[19]:
In [20]:
rec_ub.score(Xtest)
Out[20]:
In [21]:
rec_ib = cos_engine(k=7, how='item')
rec_ib.fit(Xtrain)
In [22]:
rec_ib.Xfilled.loc[100, 258]
Out[22]:
In [24]:
rec_ib.predict(100, Xtrain + Xtest).head(10)
Out[24]:
In [25]:
rec_ib.score(Xtest)
Out[25]:
In [11]:
# class cos_engine():
# def __init__(self, k, how='user'):
# self.k = k
# self.how = how
# def _cos_sim(self, x, y):
# num = np.sum(x * y)
# den = np.sum(x**2) * np.sum(y**2)
# return num / np.sqrt(den)
# def _user_based_filtering(self, user, X):
# sim = X.apply(lambda row: self._cos_sim(X.loc[user, :].values, row.values), axis=1)
# sim = sim.drop(user, axis=0)
# rec = sim.sort_values(ascending=False)#.head(self.k)
# return rec
# def _item_based_filtering(self, item, X):
# Xt = X.T
# sim = Xt.apply(lambda row: self._cos_sim(Xt.loc[item, :].values, row.values), axis=1)
# sim = sim.drop(item, axis=0)
# rec = sim.sort_values(ascending=False)#.head(self.k)
# return rec
# def fit(self, user, X):
# if self.how == 'user':
# ubf = self._user_based_filtering(user, X)
# rec = {}
# for item in X.columns.values:
# temp = X.loc[ubf.index, item]
# rec[item] = ubf.loc[temp > 0].head(self.k)
# self.rec = rec
# elif self.how == 'item':
# rec = {}
# for item in X.columns.values:
# print('Fitting item ', item) # This is pretty slow...
# ibf = self._item_based_filtering(item, X)
# Xt = X.T
# temp = Xt.loc[ibf.index, user]
# rec[item] = ibf.loc[temp > 0].head(self.k)
# self.rec = rec
# else:
# print('Use either "user" or "item" for the parameter "how".')
# def _predict_one(self, user, item, X):
# try:
# if self.how == 'user':
# idx = self.rec[item].index
# if len(idx) == 0:
# return X.loc[user, item]
# weighted_sum = np.sum(X.loc[idx, item].values * self.rec[item].values, axis=0)
# return weighted_sum / np.sum(self.rec[item].values, axis=0)
# elif self.how == 'item':
# Xt = X.T
# idx = self.rec[item].index
# if len(idx) == 0:
# return X.loc[user, item]
# weighted_sum = np.sum(Xt.loc[idx, user].values * self.rec[item].values, axis=0)
# return weighted_sum / np.sum(self.rec[item].values, axis=0)
# except AttributeError:
# print('You have to fit the recommender first!')
# def predict(self, user, X):
# pred = []
# for item in X.columns.values:
# pred.append(self._predict_one(user, item, X))
# return pd.Series(np.array(np.round(pred), dtype=int), index=X.columns)
# def score(self, user, X):
# try:
# pred = self.predict(user, X)
# rated_items = X.loc[user, :] > 0
# user_ratings = X.loc[user, rated_items]
# pred_ratings = pred.loc[rated_items]
# return np.sum(user_ratings == pred_ratings) / len(user_ratings)
# except AttributeError:
# print('You have to fit the recommender first!')
In [19]:
# solution
# Libraries
from sklearn.metrics.pairwise import cosine_similarity
np.abs(cosine_similarity(Xtrain)).sum(axis=1).reshape(943, 1).shape
Out[19]:
Above we used Cosine Similarity to fill the holes in our sparse matrix. Another, and much more popular, method for matrix completion is called a Singluar Value Decomposition. SVD factors our data matrix into three smaller matricies, given by $$\textbf{M} = \textbf{U} \bf{\Sigma} \textbf{V}^*$$ where $\textbf{M}$ is our data matrix, $\textbf{U}$ is a unitary matrix containg the latent variables in the user space, $\bf{\Sigma}$ is diagonal matrix containing the singular values of $\textbf{M}$, and $\textbf{V}$ is a unitary matrix containing the latent variables in the item space. For more information on the SVD see the Wikipedia article.
Numpy contains a package to estimate the SVD of a sparse matrix. By making estimates of the matricies $\textbf{U}$, $\bf{\Sigma}$, and $\textbf{V}$, and then by multiplying them together, we can reconstruct an estimate for the matrix $\textbf{M}$ with all the holes filled in.
Use these ideas to construct a class svd_engine which can be used to recommend movies for a given user. Be sure to also test your algorithm, reporting its accuracy. Bonus: Tune any parameters.
In [26]:
class svd_engine():
def __init__(self, k):
self.k = k
def fit(self, X):
if self.k <= np.min(X.shape):
U, S, V = np.linalg.svd(X)
Uk = np.dot(U[:,:self.k], np.sqrt(np.diag(S[:self.k])))
Vk = np.dot(np.sqrt(np.diag(S[:self.k])), V[:self.k,:])
self.Xfilled = pd.DataFrame(np.dot(Uk, Vk), index=X.index, columns=X.columns)
else:
print('k has to be smaller than the minimum dimension of X')
def predict(self, user, X):
try:
ratings = self.Xfilled.loc[user, :]
rated_items = X.loc[user, :] > 0
ratings = ratings[~rated_items]
return ratings.sort_values(ascending=False)
except AttributeError:
print('You have to fit the recommender first!')
def score(self, X):
try:
return np.sum(np.sum(np.abs(self.Xfilled - X.where(X > 0)))) / X.where(X > 0).count().sum()
except AttributeError:
print('You have to fit the recommender first!')
In [27]:
rec_svd = svd_engine(k=100)
rec_svd.fit(Xtrain)
In [28]:
rec_svd.Xfilled.loc[100, 258]
Out[28]:
In [30]:
rec_svd.predict(100, Xtrain + Xtest).head(10)
Out[30]:
In [31]:
rec_svd.score(Xtest)
Out[31]:
In [32]:
scores = []
for k in np.arange(10, 950, 10):
rec_svd = svd_engine(k=k)
rec_svd.fit(Xtrain)
scores.append([k, rec_svd.score(Xtest)])
In [35]:
pd.DataFrame(scores, columns=['k', 'MAE']).sort_values(by='MAE').head()
Out[35]:
In [36]:
scores = []
for k in np.arange(2, 11):
rec_svd = svd_engine(k=k)
rec_svd.fit(Xtrain)
scores.append([k, rec_svd.score(Xtest)])
In [37]:
pd.DataFrame(scores, columns=['k', 'MAE']).sort_values(by='MAE')
Out[37]:
In [42]:
rec_svd = svd_engine(k=10)
rec_svd.fit(Xtrain)
In [43]:
rec_svd.Xfilled.loc[100, 258]
Out[43]:
In [44]:
rec_svd.predict(100, Xtrain + Xtest).head(10)
Out[44]:
In [45]:
rec_svd.score(Xtest)
Out[45]: